Detecting Commas in Slovak Legal Texts
نویسندگان
چکیده
This paper reports on initial experiments with automatic comma recovery in legal texts. In deciding whether to insert a comma or not, we propose to use the value of the probability of a bigram of two words without a comma and a trigram of the words with the comma. The probability is determined by the language model trained on sentences with commas labeled as separate words. In the training database one sentence corresponds to one line. The thresholds of bigrams and trigrams probability were experimentally determined to achieve the best balance of precision and recall. The advantage of the proposed method is its high precision (95%) at a relatively satisfactory recall (49%). For judges as potential users of an ASR system with an automatic comma insertion function, precision is particularly important.
منابع مشابه
Slovak Automatic Dictation System for Judicial Domain
This paper describes the design, development and evaluation of the Slovak dictation system for the judicial domain. The speech is recorded using a close-talk microphone and the dictation system is used for on-line or off-line automatic transcription. The system provides an automatic dictation tool in Slovak for the employees of the Ministry of Justice of the Slovak Republic and all the courts i...
متن کاملAutomatic Text Formatting for Social Media based on Linefeed and Comma Insertion
By appearance of social media, people are coming to be able to transmit information easily on a personal level. However, because users of social media generally spend little time on describing information, low-quality texts are transmitted and it blocks the spread of information. On transmitted texts in social media, commas and linefeeds are inserted incorrectly, and it becomes a factor of low-...
متن کاملAutomatic Comma Insertion for Japanese Text Generation
This paper proposes a method for automatically inserting commas into Japanese texts. In Japanese sentences, commas play an important role in explicitly separating the constituents, such as words and phrases, of a sentence. The method can be used as an elemental technology for natural language generation such as speech recognition and machine translation, or in writing-support tools for non-nati...
متن کاملThe Similarity Detection in Slovak Texts by Compression Method
This paper deals with similarity and plagiarism with a focus on the Slovak texts. It presents and analyzes standard methods and tools used to detect plagiarism in order to use the conclusions of its own solutions. It explains the principle of dictionary method for data compression known as the Lempel-Ziv, which idea of creating the dictionary is used as the basis for our method proposal to dete...
متن کاملThe (re)presentation of the Author in Czech and Slovak Scientific Texts
This paper poses the question of how academic writers present themselves to the audience and focuses on the functions of forms of self-reference in Czech and Slovak scientific discourse. For scientific texts, the Latin rhetoric tradition recommended the so called pluralis modestiae or pluralis auctoris as an appropriate linguistic means of self-presentation of the writer, conveying his modest a...
متن کامل